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ABSTRACT 


This paper presents research and provides a method to ensure that parallel assessments, that are 
generated from a large test-item database, maintain equitable difficulty and content coverage each 
time the assessment is presented. To maintain fairness and validity it is important that all instances 
of an assessment, that is intended to test the same subject content, are presented to each test-taker 
without bias. The method described demonstrates how each test-item in an item bank! (i) is as- 
signed a difficulty rating using a recognized test-centered* legally defensible cut-score* determi- 
nation method, (ii) assigned to either hard, moderate, or easy categories, and (iii) then selected in 
a stratified random selection* fashion, by content and difficulty, to ensure equal content coverage 
and difficulty level within an acceptable range of the calculated cut score. 


1 An item bank is an organized collection of items. 

* Test-centered methods rely on judgments about test-items, whereas the examinee-centered methods rely on 
judgments about examinees. 

3 To be legally defensible and meet the Standards for Educational and Psychological Testing, a cut score cannot be 
arbitrarily determined, it must be empirically justified. 

* Stratified sampling divides the population or data into groups or blocks. Random samples or selections are taken 


from each group or block. 
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EXECUTIVE SUMMARY 


With the proliferation of computer based, online and paper-based assessment/testing platforms the 
storage of test-items (questions) in a database has become commonplace. Without proper prepara- 
tion of the database, which includes the initial assignment of key fields within the structure of the 
database, a danger of unfair assessment/testing exists. This unfairness is created by using random 
selection of test-items from a database without regard to item difficulty and coverage of content 
that supports the objectives or competencies being assessed. This unfairness is amplified when 
randomly selecting items from a database to produce parallel forms of assessments/tests to be 
given simultaneously to a group or as make-up or retests. 


The purpose of this research study is: 


e to present evidence that when test-items are selected at random from a database, the re- 
sulting parallel forms of the assessments/tests will NOT: 
© cover content equally with each iteration of a parallel assessment/test 
o be equivalent in regard to difficulty of test-items 
e to present evidence that when using stratified random selection of test-items from a data- 
base, the resulting parallel forms of the assessments/tests will: 
o cover all content equally and at the same difficulty with each iteration of a parallel 
assessment/test 
© maintain an overall assessment/test difficulty within an acceptable range of the 
calculated cut score 


The paper provides a brief background on test development and procedures for establishing defen- 
sible cut or passing scores. In order to conduct the study, three experiments were conducted using 
both hypothetical and real client test-item difficulty data. The test-item data was entered into a 
spreadsheet tool developed by James R. Parry that: 


e calculates item difficulty based upon the results of a cut-score rating session and assigns a 
rating of hard, moderate, or easy, 

e calculates a cut or passing score for the entire database, and 

e provides a final assessment/test design criterion that will maintain equality in both content 
coverage and difficulty for parallel test forms. 


The overall findings conclude that pure random selection of test items will not produce fair parallel 
forms of assessments/tests but stratified random selection will. 


5 Parallel forms are different versions of a test that measure the same objectives and yield similar results. (Shrock 


& Coscarelli, 2007) 
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Recommendations of the study are: 


In order to maintain fairness and ensure parallel forms of assessments/tests are valid, reliable, 
without bias and defensible when generated from a test-item database: 


Test-items must be constructed using universally recognized standards 

Cut scores should be established using a recognized test-centered method or, if appropriate, 
a test-taker centered method, because arbitrary methods are not defensible 

Each item in a test-item database should be evaluated by a panel of expert judges and a 
difficulty score or rating established based upon the agreed upon MAC level of the target 
test-taker 

Test items should be selected using stratified randomization based upon both topic cover- 
age as well as item difficulty to ensure equitable parallel assessments are generated 

Tests should not be generated in a pure random fashion from a test-item database without 
regard to content because content coverage will be erratic 

Tests should not be generated in a pure random fashion from test-item database without 
regard to difficulty of test-items because difficulty among tests will be erratic 


Copyright © 2020 by James R. Parry 


TABLE OF CONTENTS 


ABST RAG I ioceusceszcsssisaacvand Ducenecayedsasnsegaessanasaieaongedehsaeatw aes (asetcensunoadeaneyecesnaeesedetegetebamevantvaneseeenevietitense segecameatants 1 
EXECUTIVE.SUMIMARY scseccistessedcascenceeStasket sgancesdiasntlescesdacssces iad Gaeuceteiael ceeciwens Shares eseceetevseblestede tases 2 
INTRODUCTION jsescvisdscecvensceadsouncersseadcdscvancivenrenssess seeaceuvsanvacesuaedjccesaaviccevasugncesbtuancevetsguenvabiuedsraaudueresacediaaunioess 5 
RESEARCH QUESTIONS scccszcescsstcnds Gastegvts coches cosSscieccsz senes ge tatys ouacngede shee. catches ssenteuzes cacbessaedtaeacessedenncte tee anutews 7 
THE TESTING: PROCESS. :ncc.ascvcdessisevieesseestteus cvseevane codesvve evsnes soo vulaseyeevsanassessdansesntswattddveonstvsaysseeesntesveeseceeseedees 7 
TEST DESIGN tiaiccesesancduvessacecdedexcddvesaiscectaasaccenrsesiicanseaaadenseu sacs sageddeuvaasy cde vanedded sna edetvsekoudeess cu ddeVaaeddea velseenestaceees 8 
ESTABLISHING A DEFENSIBLE CUT SCORE OR DIFFICULTY RATING .........ccsccecceseceeeeeeeeeeeeeeeeeeeeeeaeesaeesaaeeaaes 11 
Angoff/Modified Angoff Method...........ccccsccccssscesssecsssceesseceesseceseeccsseeecsseceseecsaeesesaececaeecesuecesaeeseaaecesseeens 12 
ITEM DATABASES wsiisiceasvenccessgesadenscanccesctedacesnveusscevaeusdecvsaasuceevadegucesbtcuncesntsdeeesancsetevsddeuaevotdendsGotidec: raecceseieaaces 15 
EXPERIMENTAL PROCEDUREG.........::cc:sccsseceseceseceseceseceeeeeceseeeseaeeeaeesaeecaaecaaeesaeseaeseaeseaeseaeeseneseaeeeaeeeaeesaaeesaes 16 
Basic description of the Questionmark OnDemand assessment platfOrmM:...........ccccccseesssscceeeecesssssaeees 16 
Design philosophy of the Compass Consultants, LLC spreadsheet tool: ...........c:cccccccesssssssececeeeeessessaeees 16 
Experiment #1A — Random Selection — Hypothetical Data.............ccccccccccecsssssssseceeeeecesseseaeeeeeeseessesseaees 17 
Experiment #1B — Stratified Randomization — Hypothetical Data..............ccccscsssccececessesssteeeeeeseessessaaeees 21 
Experiment #2A — Random Selection — Real Client #1 Data ...........ccccccccccccssssssssececeeecessesscaeeeeeeseesssssnaeees 25 
Experiment #2B — Stratified Randomization — Real Client #1 Data ............cccccsssccceessececessteeeesesseeeeseeaes 30 
Experiment #3A — Random Selection — Real Client #2 Data ...........ccccccccccccsssssssseceeeeecessessaeeeeeeseessessaaeens 33 
Experiment #3B — Stratified Randomization — Real Client #2 Data ..............ccccssssecceecessessnseeeeeeeeesseseaaeees 37 
GONCLUSIONS wise isn sesiccestasicdeveteuten vadedecensted dceged leads vaneiaeastaccad sucdsicasoisaedhcugadeee Fensivdestacddcnser Secs dausacenvabeledaaelae 40 
RECOMMENDATIONS is sss sisiszieessce.paeetesaueeenceduceusosgvuedsatsvagessdevedenes Coss abaeves doug gayat'sstegsaned sap vasearrendncdapsehevenese 43 
WISTOORFIGURES sess cicscasesn iasunsussades cassigess casvnghasudanesdeusscesucedsgestcecsncexsedenesduevadeesandsgeatveseiahasadssunsavsedeesaedezesteases 44 
LIST!OF TABLES wsdcccsdecsaivexcceseGanncentceeccescitucecessuscceyaeesdecvsdasicestelagaceshteaecesotngatesaeseunevsdueuacveadends Gouden: ceeacesevehiees 45 
APPENDIX Avi scetesesctnscesechctssitncthacacasastusszevecunaesteresvesspeenduaae tase usdosuvactsebeseeeeveend untitndevaseuanedtebapasarearndeettieieietnss A-1 
Description of the spreadsheet tool: ...........cccsessscccececessessaeceeeceseeseaeaeceeeceseesaaeaeeeescsseeseeaeeeeeeseeeseseeaeees A-1 
Design philosophy of the Compass Consultants, LLC spreadsheet tool: ...........cccccccsssesssscceeeeseessessseees A-2 
Function of the Spreadsheet TOol.........ccccccccsssssscecccscsssesssaecececsceeseaeaeeeeecesseseeaeseeeeecesseseeaeaeeeeseessesenaeees A-2 
LIST OF FIGURES IN APPENDIX A... ....eeceeecesecceseeceaeceaeceaeeeseeeeeeseeeeeeeeaeeeaeesaeecaaecaaecaaecsaeseaeeeaeenaeeeaeeseeeseneeaes A-8 
LIST OF TABLES: IN| APPENDIX A v.cscesceescaseeius sancvednveutes sces dues vansevsnvaraveces sete sentusnevaasedens.saveentevendsaseus ies veeases A-8 
REFERENCES. s.scccecvssccesevsnscous soticnsssedce scueaaivenrenneece sausceuyseavacesvaadiecesaauince sesauncvsatuaceeuaseueservadeusesuandevsveeisesieance R-1 
ACKNOWLEDGEMENTS ois. scnic-asvtcccs casasavceshectzeasedsci-asschencaanicsas csnsedecaadsensse¥ lees gaucesueacesteansedseneatcectaseaendeees’ Ack-1 
ABOUT THE AUTHOR viccvssssteentcaaeds encesade seed eacyvees secs deaetaeuseceos sededentyyaestidsednee vag vaceiusavelevingvedeansdeeseteorwast eee Ack-1 


Copyright © 2020 by James R. Parry 


INTRODUCTION 


Assessments are important evaluation tools used in educational, industrial, government, medical, 
military and other organizations throughout the world. With the proliferation of electronic/com- 
puter databases used to store test-items and generate randomized assessments, comes the inordi- 
nate possibility of unfairness. When assessments are intended to be parallel, this unfairness can be 
either in (1) difficulty, i.e. some assessments are more difficult than others; or (ii) content, i.e. 
selecting more or less items from one or more topics for every instance of the assessment. If par- 
allel assessments are not equal in both difficulty and content, both the test-taker and the testing 
organization are at risk. An unfair assessment that consists of mostly “easy” test-items may serve 
to indicate that a minimally competent candidate is qualified when in fact, they just “got lucky’, 
whereas an assessment consisting of mostly “hard” test-items, on the same topic or topics as the 
easy assessment, may deny a candidate, who is minimally qualified or competent, a position or 
promotion that they truly deserve. Alternatively, what if, because of random item selection, all 
instances of an assessment do not test the same content? This would provide incorrect assumptions 
that all candidates “knew” all of the required objectives or competencies when, in fact, they were 
not tested on all requirements. 


Example: 


A hypothetical 20-item end of unit assessment on electrical safety is intended to test four topics of 
equal importance: 


a) Grounding 

b) Lock-out-tag-out 

c) Personal safety equipment 
d) Insulation 


Because all topics are considered to be of equal importance, we assume that five test-items should 
be presented for each topic. The test-item data base contains 100 test-items pertaining to electrical 
safety, not sorted by topic or by difficulty. The assessment is administered via computer with each 
student receiving a randomly generated version with a selection criterion of “select 20 questions 


99 


at random from topic ‘electrical safety’”. 


Think about all of the possible outcomes of this item selection criterion in terms of both test diffi- 
culty and content and the problems associated with this method. With the possibility of generating 
an unfair assessment, why would test administrators even want to generate a random assessment 
instead of relying on a single fixed form®? 


5 A fixed-form assessment asks all test-takers to respond to the same questions or tasks in the same order. 
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There are several advantages for using randomization to select test-items from a database instead 
of a fixed form which include: 


e Anti-cheating — all test-takers do not receive the same test-items so looking at the computer 
screen or paper of a nearby test-taker will not present an advantage 

e Ability to add or retire’ test-items — new test-items can be added to or retired from the item 
database as references or competencies change 

e Provides the ability to administer the assessment to the same test-taker multiple times with 
different test-items 

e Allows test administrators to deliver the assessment at different times without fear of pre- 
test communications among test-takers (i.e. test-taker on East coast of the U.S. calling test- 
taker on West coast of U.S. about the test contents) 


An alternative to pure random selection of test items to generate parallel forms is to used a fixed 
form but randomize both the test-items and alternatives within the assessment/test. This will pro- 
duce parallel forms each time and maintain both content and difficulty fairness. Disadvantages to 
this method are possible overexposure of items to test takers and the inability to produce make-up 
or retests that would present alternative test items. Because the stem of the test-items remains 
unchanged there is still a danger of pre-test compromise due to communication among test-takers. 


So, what are the dangers of using pure random selection to generate assessments/tests? 


e Randomization may produce measurement error’ in that some test takers may be presented 
with a difficult assessment/test and others may receive an easy one. When an assess- 
ment/test is used to classify test-takers into groups, two kinds of wrong decisions can occur 
(Livingston & Zieky, 1982): 

o A test test-taker who actually belongs in the lower group can get a score above the 
passing score 

o A test-taker who actually belongs in the higher group can get a score below the 
passing score 


Classifying someone into the wrong group could lead to less than qualified individuals being pro- 
moted or certified or, those who are qualified being denied advancement or certification. These 
situations could lead to legal challenges that may be difficult to defend. 


The item selection method that I propose to alleviate possible unfairness when using randomization 
to select test-items is described in detail in this paper. 


7 Once a test-item has been used on an actual assessment it should never be deleted entirely from an item data- 
base due the possibility of future legal proceedings. Retiring the item (if available in the software) retains the item 
and all statistics in their original form. 

8 Measurement error in education generally refers to either (1) the difference between what a test score indicates 
and a student’s actual knowledge and abilities or (2) errors that are introduced when collecting and calculating 


data-based reports, figures, and statistics related to schools and students. (Great Schools Partnership, 2013) 
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RESEARCH QUESTIONS 


1. When test-items which have been assigned a difficulty rating using a recognized test-cen- 
tered, empirically justified procedure and placed in appropriate topic classification areas at 
various difficulty levels (easy, moderate, or hard) are randomly selected from a computer 
based test-item bank, will every iteration of the test generated be presented at a similar 
difficulty level and cover all topics within objectives adequately? 

2. When test-items which have been assigned a difficulty rating using a recognized test-cen- 
tered, empirically justified procedure and placed in appropriate topic classification areas at 
various difficulty levels (easy, moderate, or hard) are selected in a stratified-random man- 
ner from a computer based test-item bank, will every iteration of the test generated be pre- 
sented at a similar difficulty level and cover all topics within objectives adequately? 


Before any item selection process is attempted it is important to design both effective test-items 
and test instruments. I will begin with a brief overview of the testing process and design that is the 
basis for the stratified random selection method that I propose. 


THE TESTING PROCESS 
I begin with a quote from Steven M. Downing, University of Illinois at Chicago: 


“Effective test development requires a systematic, well-organized approach to ensure suf- 
ficient validity evidence to support the proposed inferences from test scores.” 


(Downing, 2006) 


The testing process has many stakeholders which may include members of management, admin- 
istrators, facilitators, instructors, teachers, test administrators, etc. but the most important stake 
holder is the test-taker. If the test is unfair or biased, the test-taker is placed at a disadvantage. 


This leads to another quote: 


“All tests should be well developed and testing practices, beneficial. There is extensive 
evidence documenting the effectiveness of well-constructed tests in relation to supporting 
the validity of the test. The proper use of tests can result in making wiser decisions about 
individuals and programs than those made without using tests. The improper use of tests, 
however, can cause considerable harm to test-takers and others affected by test-based de- 
cisions. 


(AERA; APA; NCME, 2014) 
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Whether an assessment/test is norm-referenced? or criterion-referenced'®, it must be fair to all test- 
takers. If a test is being administered to individual or multiple participants it must be developed 
and constructed to ensure it is both valid — measures what it is supposed to measure, and reliable - 
effectively measures anything at all. 


TEST DESIGN 


The first, and most important step in creating fair, defensible assessments or tests is to ensure 
validity. All test-items must match the required job skills or certification requirements. Referring 
to the Designing Criterion Referenced Tests flow chart (Figure 1) as a guidance tool, it should be 
noted that test-item design comes right after the job analysis phase, before course material is cre- 
ated. 


Designing Criterion-Referenced Tests 
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Adagied fom Steock & Cascarell, W (2007), page 45. Figure 31 


Figure 1 — Designing Criterion-Referenced Tests 


The next big step is to determine how many items to develop. To solve this dilemma, ask the 
following questions: 


e How critical are decisions based upon the results of the test? 

e What resources (time, money, and personnel) are available for testing? 
e How big is the overall objective that is being tested? 

e How closely related are the objectives that are being tested? 


Table 1, which was adapted from Criterion-Referenced Test Development, Technical and Legal 
Guidelines for Corporate Training (Shrock & Coscarelli, 2007), provides a starting point to help 
determine the number of test-items required per objective being tested. Once the number of items 


° Anorm-referenced test compares people in relation to the test performance of one another. It is generally com- 
posed of items that will separate the scores of test-takers from one another and is typically used to rank-order se- 
lect top performers. 

10 A criterion-referenced test compares people to a standard. It is composed of items based on specific objectives 
or competencies. It defines the performance of each test-taker without regard to the performance of others. It is a 


test of mastery. 
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that are required to adequately test an objective has been determined, based on the table, I suggest 
that number be increased three to five times to allow for an adequate number of items to be avail- 
able for stratified randomized selection from a test-item database. This ensures that there are an 
adequate number of items available in each topic to generate several iterations of a parallel assess- 
ment without repeating the same items. 


Number of Test Items per Objective 
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Table 1 - Number of Test-Items per Objective 


Typically, there are several topics within an objective, some that are considered more important or 
critical than other topics. Table 1 can be used as a guide to assist in this decision process. An 
example may be derived from the hypothetical 20-question end of unit assessment on electrical 
safety presented in the introduction to this paper. This assessment is designed to test four topics of 
equal importance: 


a) Grounding 

b) Lock-out-tag-out 

c) Personal safety equipment 
d) Insulation 


Because the topics are considered to be of equal importance, 25% of the test-item database is 
dedicated to each of the topics. To allow for stratified randomization to generate a 20-item assess- 
ment, the number of items available in the database for each topic would be as follow: 


a) Grounding 15 - 25 
b) Lock-out-tag-out 15-25 
c) Personal safety equipment 15-25 
d) Insulation 15 - 25 


This requires a test-item database size of between 60 to 100 items to select from to provide five 
items for each topic on each assessment and ensure different but equal items on each randomly 
generated assessment. 
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If all topics are not considered to be equally important or critical then the number of items available 
per topic, within each objective, would be adjusted, for example if Grounding is considered the 
most important topic it may be ranked as 50% of the assessment with Lock-out-tag-out ranked 
second at 30% and both Personal safety equipment and Insulation ranked equally important at 
10% each. This would require each assessment to be planned and generated as illustrated in table 
2 


Topic % of Test Number on 20 Number Available 
CONC IM Kect in Database 
Grounding 50 10 30 - 50 
Lock-out-tag-out 30 6 18 - 30 
Personal safety equipment 10 2 6-10 
Insulation 10 2 6-10 


Table 2 - Topic Weights for Electrical Safety Assessment 


As can be derived from table 2, it is easy to see how test-item databases can become quite large to 
ensure randomization. Seems pretty straight forward at this point but what about difficulty? It is 
important to try to maintain a good balance of easy, moderate and hard items to select from, and 
include on each test or assessment. When the test-item authors, in consultation with subject matter 
experts (SMEs), are building the items, they generally will have an idea as to the difficulty of each 
item with more complex items being more difficult than simple items in most cases. The actual 
difficulty will not be confirmed until a cut-score rating session is convened. 


Another consideration is test length vs. time available. What if, using table 1 as a guide, the test 
length is determined to be 100 questions. How much time is available for testing? Research and 
best practice indicated that typical response time for each test-item, on a written test, without ref- 
erences provided, ranges from 42 seconds for an easy item to 90 seconds for a very difficult item. 
Table 3, based on information in a paper by Phil Higgins (Higgins, 2009), provides a tool to assist 
in determining time to be allotted for administration of an assessment, not including administrative 
time (attendance, paperwork, reading test procedures/rules, etc.). Referring to table 3, a typical 
100 item 4-alternative multiple-choice assessment that is considered to be of moderate difficulty 
would require approximately 5,500 seconds or about 1 hour 32 minutes hours to complete, plus 
administrative time for a total estimated time of about 2 hours. The table is useful as an initial 
planning tool. Using stratified randomization that defines the actual number of easy, moderate, 
and hard items selected for each iteration of the assessment, could further refine initial estimates 
of test time allotted (e.g. 50 item assessment consisting of 20 easy, 20 moderate, and 10 hard test 
items would yield an approximate test time of 43 minutes). The time it takes for individual test- 
taker to respond to each item is affected by several factors including, but not limited to language 
comprehension, education level, reading ability, previous exposure to test item, test-taker motiva- 
tion, test-taker fatigue, unfamiliarity with test administration system, test anxiety, disabilities, etc. 
A possible problem with rule of thumb time estimates is that the legality may be challenged unless 
there is reliable data to back up the estimate. How can the test designer prove that the time allotted 
is sufficient? A review of statistical information related to the time of response by item, which is 
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typically available with online test administration software suites, will provide a more accurate 
estimate once the assessments have been in use. 


Time Allotted per Test Item 


Overall Test or 

Item Difficulty 

Time per test 
item 


Easy Moderate Hard 


43 Seconds 55 Seconds 61 Seconds 


2150 seconds 2750 seconds 3050 seconds 
(approx. 36 (approx. 46 (approx. 51 
minutes) minutes) minutes) 
4300 seconds 5500 seconds 6100 seconds 
(approx. 72 (approx. 92 (approx. 102 
minutes) minutes) minutes) 

Table 3 - Test Time Tool 


50 Item Test 
Time 


100 Item Test 
Time 


It is interesting to note that the U.S. Coast Guard found that if references are allowed to be used 
during the test, the average time spent per item increases by about 36% (United States Coast Guard, 
2015). If the time required is excessive, consideration must be given to omit some topics or rethink 
the purpose of the test outcome — 


e Are the objectives being tested covering too large of a content domain? 

e Are all of the topics being tested actually required and supported by the objectives or cer- 
tification requirements? 

e Are the “topics” too far “down in the weeds’? In other words, is it a step of a topic at a 
higher level? (e.g. Unscrewing a light bulb is a step in a larger topic of changing a light 
bulb). 


ESTABLISHING A DEFENSIBLE CUT SCORE OR DIFFICULTY RATING 


Establishing a cut score on an assessment or test is also known as standard setting. The ‘standard’ 
may be in the form of pass/fail, issue/non-issue of a license or certification, award/withhold a 
credential, etc. The cut score, in order to be legally defensible, cannot be established arbitrarily, 
it must be empirically justified!' (AERA; APA; NCME, 2014). There are several recognized meth- 
ods, both test centered and test-taker centered”, to establish defensible cut scores. Gregory J. Cizek 
(Cizek, 2006) has done a great job describing and summarizing many of these in his paper Stand- 
ard Setting. One thing to keep in mind, according to the Standards for Educational and Psycho- 
logical Testing, “There can be no single method for determining cut scores for all tests or for all 
purposes, nor can there be any single set of procedures for establishing their defensibility” (AERA; 
APA; NCME, 2014) pg. 100. 


Most, if not all, of the recognized methods for establishing a cut score are designed around fixed 
form tests or assessments where the test-items are selected manually and a cut score established 
on the contents of a single test. If a second, third or subsequent test is generated the cut score 


11 Something empirically justified can be provable or verifiable by experience or experiment (Dictionary.com 
Unabridged, 2020) 
? Test-taker centered methods require judges to make decisions based on their knowledge of the examinees and 
their performance 
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would typically vary because each iteration of the test would be judged individually. The method 
I propose is a variation of the Modified Angoff Method that evaluates the entire database as a 
whole, establishes a cut score and difficulty rating for all items within the database and produces 
arecommended stratified randomized test design that maintains both the cut score as well as covers 
all topics equally and at similar difficulty with each iteration of a test. The Angoff/Modified An- 
goff Method with my variations is described as follows: 


Angoff/Modified Angoff Method 

The Angoff method of determining and setting a cut score for an assessment uses subject matter 
experts (SMEs) as judges to review each item and assign a score or weight to the item based on 
the judge’s conjecture that a minimally competent performer or a test-taker who is at the minimum 
acceptable competence (MAC) level required for the job or certification would answer the item 
correctly. It should be noted that the Angoff Method is sensitive to both difficulty and importance, 
thus, for a welder, putting on safety glasses is easy, but all would be expected to be able to perform 
this task — while landing a plane in wind shear is difficult, but all pilots would also be expected to 
pass this task (Coscarelli, Barrett, Kleeman, & Shrock, 2005). These scores, based on 100 test- 
takers, are then summed and averaged to assign a difficulty rating to each item. i.e. Six judges 
score an item as .65, .70, .65, .75, .60, .70. The average of the scores is .675 or 67.5% which 
translates to an estimation that 67.5% of test-takers at the MAC level would respond correctly to 
the item. The Angoff Method establishes both a floor and ceiling score for each item. Typically, 
the floor score (the lowest a judge is permitted score an item) is based on the number of alternatives 
for the test-taker to choose from and the ceiling is based upon the results of the judges, who are 
considered to be experts, average test scores. The ceiling cannot be higher than the judges average 
scores because a test-taker at the MAC level could not be expected to respond correctly if an expert 
cannot. The Angoff Method, as with most test-centered methods, is designed to rate the test-items 
on a single form of a test, so, in theory, each form of the same assessment, all testing the same 
content, could have a different cut score because each item has a unique Angoff weight (Coscarelli, 
Barrett, Kleeman, & Shrock, 2005) (e.g. test form A has a cut score of 82%, test form B has a cut 
score of 80%, and test form C has a cut score of 85%). Each of the cut scores would probably be 
legally defensible because a recognized standard setting method was used to set the scores but the 
testing organization may have difficulty explaining the score difference both internally and exter- 
nally. 


In the paper The Problem of the Saltatory Cut-Score: Some Issues and Recommendations for Ap- 
plying the Angoff to Test-item Banks (Coscarelli, Barrett, Kleeman, & Shrock, 2005) the item da- 
tabase was divided into two groups — the first containing all items with an Angoff score less than 
or equal to the median score and the second group containing all items with scores greater than the 
median score. The paper explains how item selections were made using simple randomization as 
well as stratified randomization to generate a fair test. The stratified randomization described, al- 
ternates item selection from those items equal to the median Angoff score, those less than or equal 
to the Angoff score and those greater than the median score. This method guarantees an assessment 
that is neither extremely difficult or extremely easy. Their conclusions were: 


e For low stakes tests randomly sample within the item bank 
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e For medium stakes tests, one can probably sample within the bank if the distribution is 
statistically normal, but stratification is safer 

e For high stakes tests, one should consider stratification of the sample for increased preci- 
sion 


The methods and results described in the study indicate that random sampling/selection generally 
will produce tests with the approximate same difficulty when drawn from a small sample but as 
the size of the sample (database) increases, some sort of stratified selection should be used. The 
authors warn that as the size of the database increases, so does the chance for errors in selection. 
Additionally, they recommend stratified random selection as the criticality of the assessment in- 
creases. The stratified random selection method that I propose provides greater accuracy in both 
maintaining the established cut score, presenting test-items at the same difficulty level and selects 
the same number of items from multiple topics within the item database for each iteration of the 
parallel assessment. The authors warn, and I concur, the algorithm (or any algorithm) used to select 
items is only as good as the quality of the Angoff weights (scores) that have been assigned to each 
item. 


My method, which I will call the Parry Method for simplicity, is based on the Angoff/Modified 
Angoff Methods with the following differences (Note: There are several variances or adaptations 
of the Modified Angoff Method in use throughout the testing community): 


e The Angoff Method allows any score between 0 and 1 (0% and 100%) 

o Parry Method sets the floor score at the chance guess probability for the number of 
plausible alternatives available in a multiple-choice (MC) style item (i.e. 4 alterna- 
tive MC item floor score would be .25 (25%), 3 alternative MC item floor score 
would be .33 (33%), etc. 

o Parry Method sets the ceiling score at .95 (95%) for all items based on the assump- 
tion that if all test-takers (100%) who are considered to be at the MAC level of 
competence would answer the item correctly the item is a ‘wasted’ item and does 
not provide much value in discrimination 

e Angoff Method requires each judge to “take” the assessment as a typical test-taker would 
to assign a ceiling score 

o Parry Method eliminates the requirement for the judges to “take” the assessment 
and sets the ceiling score for each item at .95 (95%). This is probably the most 
difficult task for the judges because they must put themselves in the mindset and 
knowledge level of the test-taker at whatever level they have agree upon as the 
MAC 

o The reason for not having each judge “take” the test is because of the time required 
to answer all items in large databases and the fact that, if stratified random selection 
is used to generate each test, there is not a single “test” to take. 
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e Angoff Method assigns weight or score to each item in a fixed form and sets a cut score 
based on the average item scores of the panel of judges 
o Parry Method assigns a weight or score to each item in the test-item database. (i.e. 
if the database has 300 items, each item is evaluated individually without regard to 
what the final test form will look like) and establishes a cut score for the entire 
database as a whole 
e Angoff Method generally allows the judges to rate the items at any score between 0 and 1 
(0% - 100%) 
o Parry Method only allows the judges to rate the items at fixed intervals beginning 
with the floor score and ending at the .95 ceiling at intervals of .05 (.25, .30, .35, 
A0, .45, 50, .55, .60, .65, .70, .75, .80, .85, .90, .95) 
e Angoff Method requires judges to disregard any alternative/distractor that even someone 
at the MAC level would not choose before deciding their score 
o Parry method assumes that all stems!? and alternatives were designed using recog- 
nized item development and review techniques so there should be no “wasted” dis- 
tractors but requires the judges to comment on the plausibility!‘ of existing distrac- 
tors and recommend changes if necessary 
e Angoff Method allows the judges to come together after their initial individual scoring 
sessions to discuss differences in their ratings and come to a consensus 
o Parry method calculates the standard deviation (SD)'° among judges scores for each 
test-item and requires the judges to discuss their ratings as a group if the SD is 10 
or greater to attempt to come to a consensus to bring the SD below 10 if possible. 
If the judges cannot come to a consensus the item is either retired or the judge(s) 
with the outlier score(s) is/are eliminated from the calculation. 


13 The stem of a test item presents a single definite and explicit question or problem statement 
“4 A plausible alternative is one that has the appearance of being credible or believable, not frivolous. 


15 The standard deviation is a measure of how spread out numbers are from each other. 
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ITEM DATABASES 


With the proliferation of automated (computer) test-item databases and test generation software, 
many tests are most likely generated randomly, sometimes without regard to difficulty or coverage 
of required objectives. The most important step in designing test-item databases is the initial topic 
structure which must allow test-items to be stored with separation of topics, by objective or com- 
petency, and further down as necessary. This provides a mechanism for both stratified randomized 
selection of test-items as well as future psychometric analysis. A typical topic structure is illus- 
trated below: 


REPOSITORY NAME 
OBJECTIVE 1.0 
TOPIC 1.1 
SUB-TOPIC 1.1.1 
Test-item 1.1.1/1 
Test-item 1.1.1/2 
TOPIC 1.2 
SUB-TOPIC 1.2.1 
SUB-TOPIC 1.2.2 
OBJECTIVE 2.0 
TOPIC 2.1 


SUB-TOPIC 2.1.1 
Test item 2.1.1/1 
Test-item 2.1.1/2 
SUB-TOPIC 2.1.2 
Test-item 2.1.2/1 


The sub-topics can be further divided into three difficulty “buckets” to make selection of test-items 
at various levels of difficulty clear: 


SUB-TOPIC 1.1.1 
1.1.1 HARD 
Test-item 1.1.1/1 
1.1.1 MODERATE 
Test-item 1.1.1/5 
1.1.1 EASY 
Test-item 1.1.1/9 


An alternative way to identify difficulty levels of individual test-items may be through the use of 
metatags!° of Hard, Moderate, Easy if the test-item database software supports this feature. 


16 Assigning a metatag is a way to index items by specific job tasks, knowledge, skills and abilities, difficulty, etc. to 


allow for more flexible management and selection of test-items within a large database. 
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EXPERIMENTAL PROCEDURES 


All experiments described in this paper were conducted using either hypothetical or real client test- 
item data entered into a Questionmark® OnDemand® assessment platform (www.question- 
mark.com) with test design generated by a proprietary spreadsheet tool designed by James R. 
Parry, Owner/Chief Executive Manager at Compass Consultants, LLC (www.gocompassconsult- 
ants.com). The methods described should work on any test development and delivery platform 
capable of stratified randomization using either metatags or selection by sub-topic. 


Basic description of the Questionmark OnDemand assessment platform: Can assess an un- 
limited number of test-sitters, from anywhere in the world. The platform provides a range of as- 
sessment formats including ‘drag and drop’, ‘multiple choice’ and many more. Organizations can 
conduct a range of assessments across different courses and ability ranges. Tests are automatically 
marked. Results are instantly compiled. Trends and patterns are easy and quick to spot. 


Design philosophy of the Compass Consultants, LLC spreadsheet tool: The tool, described 
fully in Appendix A, is designed to assist in setting a cut score for an assessment based on the 
results of a test-centered cut-score rating session that uses a panel of judges or experts to evaluate 
the difficulty of each test item. It can be used to assist in the design of assessments using the correct 
response P-value!” scores returned using classical test theory (CTT)'® or item response theory 
(IRT)!?. statistics Additionally, it will determine the number of items from each section at each 
level of difficulty (hard, moderate or easy) as set by the cut-score rating of each item. This as- 
sumption is made for 4-choice, multiple choice items with a floor of 25% and a ceiling of 95%. 
The difficulty is then assigned based on dividing the difference between 25% and 95% by 3 to 
arrive at the three difficulty levels. The workbook is designed to accommodate up to ten (10) re- 
viewers on the rating panel. The totals required from each section are based upon the numbers of 
each level of difficulty available in each section as well as the total number of items available. An 
assumption is made that if there are fewer items available in any particular section(s) than in other 
section(s), then that section is of less importance or has significantly fewer objectives. As data for 
each item is entered in each section, the final test design worksheet is updated automatically. 


The tool is designed to establish an initial cut score for a new test-item database. I recommend that 
after an appropriate number of statistical correct response P-value results are reported, after the 
database is in use, the recommended cut score be revisited and possibly adjusted using another 
test-centered method based on actual scores, such as the Bookmark Method?". 


17 p-value is the percentage of test-takers who selected each response. Typically, the correct response p-value is 
referred to as the difficulty index. (Shrock & Coscarelli, 2007) 

18 Classical test theory (CTT), also known as the true score theory, refers to the analysis of test results based on test 
scores. 

19 Item response theory (IRT) is a statistical way to analyze responses to tests or questionnaires with the goal of 
improving measurement of validity and reliability 

20 Bookmark Method places all test-items in a booklet, ranging from high to low (easiest to hardest) correct re- 
sponse P-value. Judges review each and place a bookmark at the point they feel the MAC will not answer the next 
harder item correctly. This point, after discussion and averaging all judges selected break point, becomes the cut 


score. 
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Experiment #1A — Random Selection — Hypothetical Data 

Referring to the example in the introduction section of a test on electrical safety that covers four 
topics (Grounding, Lock-Out-Tag-Out, Personal Safety Equipment, Insulation), assume all test- 
items are placed in a single data file called Fairness Research. All four topics are assumed to be of 
equal importance so the final test design should test all topics equally. 


Using the Modified Parry Method of cut score setting, each item was reviewed”! and assigned a 
difficulty rating with data entered into the spreadsheet tool (Table 4 — partial view) 


Erriey Topi {TROY Sotgeet 11) 
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Thar pendent teal bw the benAvetenl prepets of sad Capartght TOE DET bp Compre Comentaaty, LNT. Live ts Santali ty tame wt viv Past bw Livemes Agpenment (FLEAS Ther onpp ly teaceed te 39 OAT COMO 


A sederd dewimion of more 
then 10 will tigger en alert 


Orecues the cutters with the 


podpes whe set them te 
Getermre why Charge as 


Table 4 -Experiment # 1- Cut Score Calculation Tool (partial view) showing hypothetical data 
The cut score for the entire database of 60 items was determined to be 58.02% 


The spreadsheet tool recommended that 6.67 items at each difficulty level (hard, moderate, and 
easy) be used to ensure a cut score of approximately 58.02% was maintained. (Table 5) 


Table 5 — Experiment #1 - Recommended Test Difficulty Distribution for Hypothetical Data — Random Selection 


21 For this phase of the experiment all Angoff/Parry Method difficulty data was hypothetical and assigned to force 


equal numbers of hard, moderate, and difficult items. 
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Ignoring the recommendation of even distribution to maintain the cut score, the assessment design 
function of Questionmark OnDemand was instructed to select 20 items at random from the topic 
Fairness Research which contained all 60 available test-items (Figure 2). 


Question selections 


20 random question(s) from topic "FAIRNESS RESEARCH’ including subtopics (Avoid previously delivered) 


Figure 2 - Questionmark OnDemand Item Selection Criteria for Hypothetical Data 


Thirty (n=30) iterations of a 20-question test were generated as illustrated in tables 6, 7 and 8 


Experiment #1 « Random Selection of 20 items from all 4 topics. Desired target difficulty is 58.02 with 5 terns from each topic. 
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Table 6 - Experiment #1A - Item Selection - Attempts 1 - 10 
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Table 8 - Experiment #1A - Item Selection - Attempts 21 - 30 
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The item selection difficulty varied widely with each iteration (see ‘Difficulty’ columns in tables 
6, 7, and 8). The target cut score/difficulty was 58.02. The randomization produced a difficulty 
range between 52.67 and 65.63 with average of 58.54. The standard deviation of the scores was 
3.10 with a 95% confidence interval”” of 1.10849 which means that the true population mean is 
between 56.9 and 59.13 of the 30 samples. The kurtosis” of the average difficulty is 0.308 and the 
skewness” is 0.543. The number of items at each difficulty level from each topic varied with each 
iteration. Table 9 provides a summary of the statistics for the sample. Figure 3 illustrates the stand- 
ard distribution curve of the sample. 
Sample Difficulty Statistics 

Target Difficulty/Cut Score 

Mean Difficulty 

Median 

Minimum 

Maximum 


Variance Target vs. Mean 
Standard Deviation all Averages 
95% Confidence Score 


Figure 3 - Standard Distribution of Test Difficulty - Experiment #1A - Random Selection 


The content (topic) coverage was erratic as illustrated by the four-color (delineated by sections) 
display and ‘Total From Topic’ in tables 6, 7, and 8. 


Conclusion: All topics were not covered equally in either difficulty or content. 


22 A confidence interval is a range of values that you can be fairly sure contains the true mean of the population. 

23 Most often, kurtosis is measured against the normal distribution. If the kurtosis is close to 0, then a normal dis- 
tribution is often assumed. A low kurtosis indicates a lack of significant outliers. A high kurtosis indicates significant 
outliers. 

24 Skewness is usually described as a measure of a dataset’s symmetry — or lack of symmetry. A perfectly symmet- 
rical data set will have a skewness of 0 which is referred to as “normal” distribution. Negative skew indicates data 


is skewed left and positive indicates data is skewed right when referring to the “tail”. 
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Experiment #1B — Stratified Randomization — Hypothetical Data 
Using the same item data from experiment #1A, the items were divided into four topics within the 
main topic and further subdivided into difficulty sub-sub topics based upon the results of the cut 
score rating tool (Figure 4) 

& FAIRNESS RESEARCH 


4 © 1.0 TOPIC 1 


© 1.0 EASY 


© 1.0 HARD 


© 1.0 MODERATE 
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Figure 4 - Topic, Sub-topic Structure Used in Experiments 


The final test design recommendations to maintain fairness were as shown in table 10 
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Table 10 - Recommended Test Design - Stratified Randomization - Experiment #1B 


Note: Each topic has a different cut score/difficulty rating but the overall database difficulty re- 
mains at 58.02. 
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Following the recommended test design, the assessment design function of Questionmark 
OnDemand was instructed to select 20 items in a stratified random fashion from the topic Fairness 
Research, with five items from each topic, with equal numbers from each level of difficulty (Figure 
5). 


Question selections 


2 random question(s) from topic "FAIRNESS RESEARCH/1.0 TOPIC 1/1.0 EASY” excluding subtopics 


2 random question(s) from topic "FAIRNESS RESEARCH/1.0 TOPIC 1/1.0 MODERATE’ excluding subtopics 
1 random question(s) from topic "FAIRNESS RESEARCH/1.0 TOPIC 1/1.0 HARD’ excluding subtopics 
2 random question(s) from topic "FAIRNESS RESEARCH/2.0 TOPIC 2/2.0 EASY” excluding subtopics 
2 random question(s) from topic "FAIRNESS RESEARCH/2.0 TOPIC 2/2.0 HARD" excluding subtopics 
1 random question(s) from topic "FAIRNESS RESEARCH/2.0 TOPIC 2/2.0 MODERATE’ excluding subtopics 
1 random question(s) from topic "FAIRNESS RESEARCH/3.0 TOPIC 3/3.0 EASY" excluding subtopics 
2 random question(s) from topic "FAIRNESS RESEARCH/3.0 TOPIC 3/3.0 HARD’ excluding subtopics 
2 random question(s) from topic "FAIRNESS RESEARCH/3.0 TOPIC 3/3.0 MODERATE’ excluding subtopics 
2 random question(s) from topic "FAIRNESS RESEARCH/4.0 TOPIC 4/4.0 EASY” excluding subtopics 
1 random question(s) from topic "FAIRNESS RESEARCH/4.0 TOPIC 4/4.0 HARD’ excluding subtopics 
2 random question(s) from topic "FAIRNESS RESEARCH/4.0 TOPIC 4/4.0 MODERATE’ excluding subtopics 


Figure 5 - Item Selection Criteria - Experiment #1B 


Thirty (n=30) iterations of a 20-question test were generated as illustrated in tables 11, 12, and 13 


Experiment #1 - Directed Random Selection of 20 items from alt 4 topics, Desired target difficulty is 58.02 with 5 items from each topic. 


Table 11 - Experiment #1B Item Selections - Attempts 1 - 10 
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Total From Topic 
Topic 
Topic2 
Tonic 3 
Topic 4 


Table 13 - Experiment #1B Item Selections - Attempts 21 - 30 


The item selection difficulty remained relatively constant as desired with each iteration (see ‘Dif- 
ficulty’ columns in tables 11, 12, and 13). The target cut score/difficulty was 58.02. The stratified 
randomization produced a difficulty range between 56.57 and 62.42 with average of 58.84. The 
standard deviation of the scores was 1.24 with a 95% confidence interval of 0.4456 which means 
that the true population mean is between 58.39 and 59.29 of the 30 samples. The kurtosis of the 
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average difficulty is 0.987 and the skewness is 0.537. The number of items at each difficulty level 
from each topic varied with each iteration. Table 14 provides a summary of the statistics for the 


sample. Figure 6 illustrates the standard distribution curve of the sample. 


Sample Difficulty Statistics 


Variance Target vs. Mean 
Standard Deviation all Averages 
95% Confidence Score 


Figure 6 - Standard Distribution of Test Difficulty - Experiment #1B - Stratified Random Selection 


The content (topic) coverage was equal as stratified for each iteration as illustrated by the four- 
color (delineated by sections) display and ‘Total From Topic’ columns in tables 11, 12, and 13. 


Conclusion: All topics were covered equally as desired in both difficulty and content. Comparing 
the distribution of test difficulty scores in figures 3 and 6 shows that the stratified randomization 
consistently produced tests well within an acceptable range to meet the desired cut score of 58.02. 
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Experiment #2A — Random Selection — Real Client #1 Data 
Note: Client Test-item QIDs replaced to protect confidentiality 


Using cut score (Parry Method) results from real client data (Tables 15, 16, & 17 — partial views), 
the items were divided into three topics within the main topic and further subdivided into difficulty 
sub-sub topics based upon the results of the cut score rating tool. The cut score/difficulty for the 
entire database (71 items) was determined to be 76.13% by averaging all three topic cut scores. 
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Table 16 - Experiment #2A - Difficulty Calculations - Real Data - Topic 2 
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Table 17 - Experiment #2A - Difficulty Calculations - Real Data - Topic 3 


Topic 1 consisted of 18 test-items with a difficulty rating of 78 (Easy). Topic 2 consisted of 33 
test-items with a difficulty rating of 74 (Moderate). Topic 3 consisted of 20 test-items with a dif- 
ficulty rating of 77 (Easy). Each topic had a mix of hard, moderate, and easy items. (Table 18) 


Topic Cut 
Score & 
Difficulty 


Itemsin | % of Total | Avaiable 
Topic Items Hard 


Table 18 - Experiment #2 - Item Difficulty Distribution by Topic 


Referring to the design philosophy of the Compass Consultants spreadsheet tool as to the number 
of items drawn from each section, it appears that topic 2 was considered to be the ‘most’ important 
with 46.48% of the available items, topic 3 was the next ‘most’ important with 28.17% and topic 
1 was the least important with 25.35%. 


The final test design to maintain fairness in both content and difficulty is shown in table 19. 


Table 19 - Experiment #2 - Recommended Test Design 
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Ignoring the recommendation of item distribution to maintain the cut score, the assessment design 
function of Questionmark OnDemand was instructed to select 20 items at random from the single 
topic containing all of the test-items. Tables 20, 21, and 22 present the results. 


Experiment #2 - Random Selection of 20 items frorn all 3 topics. Real Chent Data. Desired target difficulty is 76.13 


tard i 


| total Grom Tepie Total from Topic Tetal Fram Topic Tote fram Topic 
Tonics ‘ Topic 4 ‘ Topic a a Towic 3 s 
Topics 10 | Teed ua Topic 2 a Tepie2 v] 
Tooic 5 4 Tome 3 5 Topic 3 5 Teotc 3 7 


Table 21 - Experiment #2A - Item Selection - Attempts 11 - 20 
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Table 22 - Experiment #2A - Item Selection - Attempts 21 - 30 


The item selection difficulty varied widely with each iteration (see ‘Difficulty’ columns in tables 
20, 21, and 22). The target cut score/difficulty was 76.13. The randomization produced a difficulty 
range between 73.00 and 79.95 with average of 75.87. The standard deviation of the scores was 
2.17 with a 95% confidence interval of 0.778 which means that the true population mean is be- 
tween 75.1 and 76.64 of the 30 samples. The kurtosis of the average difficulty is -0.527 and the 
skewness is 0.613. The number of items at each difficulty level from each topic varied with each 
iteration. Table 23 provides a summary of the statistics for the sample. Figure 7 illustrates the 
standard distribution curve of the sample. 


Standard Deviation all Averages 
95% Confidence Score 

Kurtosis 

Skewness 


Table 23 - Difficulty Statistics for Experiment #2A 
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Standard Distribution of Experiment 42 Samples 


Random Data 


Figure 7 - Standard Distribution of Test Difficulty - Experiment #2A - Random Selection 


The content (topic) coverage was erratic as illustrated by the three-color (delineated by sections) 
display and ‘Total From Topic’ columns in tables 20, 21, and 22. 


Conclusion: All topics were not covered equally in either difficulty or content. 
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Experiment #2B — Stratified Randomization — Real Client #1 Data 

Using the same data from experiment #2A, following the recommended test design, the assessment 
design function of Questionmark OnDemand was instructed to select 20 items in a stratified ran- 
dom fashion from the sub-topics, following the recommended design shown in table 19 (figure 8). 


Question selections 
4 random question(s) from topic "FAIRNESS RESEARCH 2/1.0 TOPIC 1/1.0 EASY” excluding subtopics (Avoid previously delivered) 


1 random question(s) from topic "FAIRNESS RESEARCH 2/1.0 TOPIC 1/1.0 MODERATE’ excluding subtopics (Avoid previously delivered) 
6 random question(s) from topic "FAIRNESS RESEARCH 2/2.0 TOPIC 2/2.0 EASY" excluding subtopics (Avoid previously delivered) 
3 random question(s) from topic "FAIRNESS RESEARCH 2/2.0 TOPIC 2/2.0 MODERATE’ excluding subtopics (Avoid previously delivered) 
1 random question(s) from topic "FAIRNESS RESEARCH 2/2.0 TOPIC 2/2.0 HARD’ excluding subtopics (Avoid previously delivered) 
4 random question(s) from topic "FAIRNESS RESEARCH 2/3.0 TOPIC 3/3.0 EASY" excluding subtopics (Avoid previously delivered) 
1 random question(s) from topic "FAIRNESS RESEARCH 2/3.0 TOPIC 3/3.0 MODERATE" excluding subtopics (Avoid previously delivered) 


Figure 8 - Item Selection Criteria - Experiment #2B 


Thirty (n=30) iterations of a 20-question test were generated as illustrated in tables 24, 25, and 26 


Experiment #2 - Directed Random Selection of 20 items fram all 3 topics. Real Client Data. Desired target difficulty is 76.13. 
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Table 24 - Experiment #2B Item Selections - Attempts 1 - 10 
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oe | 1 


j Total From Topic 


Table 26 - Experiment #2B Item Selections - Attempts 21 - 30 


The item selection difficulty remained constant as stratified with each iteration (see ‘Difficulty’ 
columns in tables 24, 25, and 26). The target cut score/difficulty was 76.13. The stratified random- 
ization produced a difficulty range between 73.00 and 75.76 with average (mean) of 74.11. The 
standard deviation of the scores was 0.74 with a 95% confidence interval of 0.2635 which means 
that the true population mean is between 73.85 and 74.37 of the 30 samples. The kurtosis of the 
average difficulty is 0.117 and the skewness is 0.579. The number of items at each difficulty level 
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from each topic varied with each iteration. Table 27 provides a summary of the statistics for the 
sample. Figure 9 illustrates the standard distribution curve of the sample. 


Sample Difficulty Statistics 
Target Cut Score 76.13 
74.11 
73.98 
73.00 
75.76 


Variance Target vs. Mean 2.04 
Standard Deviation all Averages 0.74 


95% Confidence Score 0.263545877 
0.117166773 


0.579229905 


Figure 9 - Standard Distribution of Test Difficulty - Experiment #2B - Stratified Random Selection 


The content (topic) coverage was equal as stratified for each iteration as illustrated by the three- 
color (delineated by sections) display and ‘Total From Topic’ columns in tables 18A, 18B, and 


18C. 

Conclusion: All topics were covered equally as desired in both difficulty and content. Comparing 
the distribution of test difficulty scores in figures 7 and 9 shows that the stratified randomization 
consistently produced tests well within an acceptable range to meet the desired cut score of 76.13. 
Although the simple randomization in experiment 2A produced tests with an average (mean) dif- 
ficulty (75.87) closer to the desired difficulty (76.13) than the stratified randomization in experi- 
ment 2B (74.11), the standard deviation of the results in experiment 2B (0.74) indicated signifi- 
cantly less variance from attempt to attempt than the standard deviation produced by experiment 


2A (2.17). 
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Experiment #3A — Random Selection — Real Client #2 Data 

Using cut score (Parry Method) results from real client data (Tables 28, 29, and 30 — partial views), 
the items were divided into three topics within the main topic and further subdivided into difficulty 
sub-sub topics based upon the results of the cut score rating tool. The cut score/difficulty for the 
entire database (134 items) was determined to be 64.37% by averaging all three topic cut scores. 


20 will trigger an wlert. Discuss the 
outliers weilth the judges who set 


10 will crigger om alert. Discuss the 
outliers with the judges who set 
‘them to determine why Change an 


Table 29 - Experiment #3A - Difficulty Calculations - Real Data - Topic 2 
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Table 30 - Experiment #3A - Difficulty Calculations - Real Data - Topic 3 


Topic | consisted of 42 test-items with a difficulty rating of 58 (Moderate). Topic 2 consisted of 
52 test-items with a difficulty rating of 69 (Moderate). Topic 3 consisted of 40 test-items with a 
difficulty rating of 66 (Moderate). Each topic had a mix of hard, moderate, and easy items. (Table 


31) 
Itemsin | % of Total | Avaiable 
i Ity Topic Items 


ae 
[40 | 29.85%] 2 


Table 31 - Experiment #3 - Item Difficulty Distribution by Topic 
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Referring to the design philosophy of the Compass Consultants spreadsheet tool as to the number 
of items drawn from each section, it appears that topic 2 was considered to be the ‘most’ important 
with 38.81% of the available items, topic 1 was the next ‘most’ important with 31.34% and topic 
3 was the least important with 29.85%. 


The final test design to maintain fairness in both content and difficulty is shown in table 32. 


Table 32 - Experiment #3 - Recommended Test Design 
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Ignoring the recommendation of item distribution to maintain the cut score, the assessment design 
function of Questionmark OnDemand was instructed to select 20 items at random from the single 
topic containing all of the test-items. Tables 33, 34, and 35 present the results. 


Experiment 34 - Random Selection of 20 items from all 3 topics. Real Client Data. Desired target difficulty is 64.37 
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Table 34 - Experiment #3A - Item Selection - Attempts 11 - 20 
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Table 35 - Experiment #3A - Item Selection - Attempts 21 - 30 


The item selection difficulty varied widely with each iteration (see ‘Difficulty’ columns in tables 
33, 34, and 35). The target cut score/difficulty was 64.37. The randomization produced a difficulty 
range between 59.71 and 67.54 with average of 64.15. The standard deviation of the scores was 
2.44 with a 95% confidence interval of 0.878 which means that the true population mean is be- 
tween 63.27 and 65.03 of the 30 samples. The kurtosis of the average difficulty is -0.859 and the 
skewness is -0.403. The number of items at each difficulty level from each topic varied with each 
iteration. Table 36 provides a summary of the statistics for the sample. Figure 10 illustrates the 
standard distribution curve of the sample. 


Variance Target vs. Mean 
Standard Deviation al! Averages 
95% Confidence Score 


Table 36 - Difficulty Statistics for Experiment #3A 
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Standard Distribution of Experiment #3 Samples 
Random Data 


Figure 10 - Standard Distribution of Test Difficulty - Experiment #3A - Random Selection 


The content (topic) coverage was erratic as illustrated by the three-color (delineated by sections) 
display and ‘Total From Topic’ columns in tables 33, 34, and 35. 


Conclusion: All topics were not covered equally in either difficulty or content. Although the av- 
erage (mean) difficulty of the sample is within an acceptable range of the desired difficulty/cut 
score the standard deviation indicates a significant variation among attempts. 


Experiment #3B — Stratified Randomization — Real Client #2 Data 
Using the same data from experiment #3A, following the recommended test design, the assessment 
design function of Questionmark OnDemand was instructed to select 20 items in a stratified ran- 


dom fashion from the sub-topics, following the recommended design shown in table 32 (figure 
11). 


Question selections 
1 random question(s) from topic "FAIRNESS RESEARCH 3/1.0 TOPIC 1/1.0 HARD" excluding subtopics (Avoid previously delivered) 
4 random question(s) from topic "FAIRNESS RESEARCH 3/1.0 TOPIC 1/1.0 MODERATE’ excluding subtopics (Avoid previously delivered) 
1 random question(s) from topic "FAIRNESS RESEARCH 3/1.0 TOPIC 1/1.0 EASY" excluding subtopics (Avoid previously delivered) 
1 random question(s) from topic "FAIRNESS RESEARCH 3/2.0 TOPIC 2/2.0 HARD" excluding subtopics (Avoid previously delivered) 


4 random question(s) from topic "FAIRNESS RESEARCH 3/2.0 TOPIC 2/2.0 MODERATE’ excluding subtopics (Avoid previously delivered) 


3 random question(s) from topic "FAIRNESS RESEARCH 3/2.0 TOPIC 2/2.0 EASY” excluding subtopics (Avoid previously delivered) 
1 random question(s) from topic "FAIRNESS RESEARCH 3/3.0 TOPIC 3/3.0 HARD’ excluding subtopics (Avoid previously delivered) 
3 random question(s) from topic "FAIRNESS RESEARCH 3/3.0 TOPIC 3/3.0 MODERATE’ excluding subtopics (Avoid previously delivered) 
2 random question(s) from topic "FAIRNESS RESEARCH 3/3.0 TOPIC 3/3.0 EASY’ excluding subtopics (Avoid previously delivered) 


Figure 11 - Item Selection Criteria - Experiment #3B 
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Thirty (n=30) iterations of a 20-question test were generated as illustrated in tables 37, 38, and 39. 


Experiment 38 - Directed Randam Selection of 20 items from all 3 topics. Real Client Data Desired target difficulty is 64.37 


Tote From Topic 
Topic t 6 
Tape 2 s 
Fore 3 * 


Table 38 - Experiment #3B Item Selections - Attempts 11 - 20 
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Table 39 - Experiment #3B Item Selections - Attempts 21 - 30 


The item selection difficulty remained constant as stratified with each iteration (see ‘Difficulty’ 
columns in tables 37, 38, and 39). The target cut score/difficulty was 64.37. The stratified random- 
ization produced a difficulty range between 61.93 and 64.75 with average (mean) of 63.13. The 
standard deviation of the scores was 0.76 with a 95% confidence interval of 0.274 which means 
that the true population mean is between 62.86 and 63.40 of the 30 samples. The kurtosis of the 
average difficulty is 0.127 and the skewness is 0.351. The number of items at each difficulty level 
from each topic varied with each iteration. Table 28 provides a summary of the statistics for the 
sample. Figure 12 illustrates the standard distribution curve of the sample. 


Sample Difficulty Statistics 
Target Cut Score 
Mean difficulty 


Standard Deviation all Averages 
95% Confidence Score 


Table 40 - Difficulty Statistics for Experiment #3B 
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Figure 12 - Standard Distribution of Test Difficulty - Experiment #3B - Stratified Random Selection 


The content (topic) coverage was equal as stratified for each iteration as illustrated by the four- 
color (delineated by sections) display and ‘Total From Topic’ columns in tables 37, 38, and 39. 


Conclusion: All topics were covered equally as desired in both difficulty and content. Comparing 
the spread of test difficulty scores in figures 10 and 12 shows that the stratified randomization 
consistently produced tests well within an acceptable range to meet the desired cut score of 64.37. 
Although the simple randomization in experiment 3A produced tests with an average (mean) dif- 
ficulty (64.15) closer to the desired difficulty (64.37) than the stratified randomization in experi- 
ment 3B (63.13), the standard deviation of the results in experiment 3B (0.76) indicated signifi- 
cantly less variance from attempt to attempt than the standard deviation produced by experiment 
3A (2.44). 


CONCLUSIONS 


Key factors in maintaining defensibility and fairness of assessments are being able to justify the 
cut or passing score as well as maintain validity. In that regard, each iteration of a parallel assess- 
ment or test must be parallel in both difficulty and topic coverage otherwise there is unfair bias 
each time an assessment is generated. Conclusions drawn from the experiments described in this 
paper are as follow: 


1. If test items are selected at random from a test-item database, without regard to either dif- 
ficulty or subject matter, the resulting assessments will be inconsistent in coverage of topics 
and difficulty. This is evidenced by the results of experiments 1A, 2A, and 3A. 

a. Selection of the number of test-items from each topic was inconsistent with each 
iteration of the randomly generated assessments. 
i. Experiment 1A tables 6, 7, and 8 
li. Experiment 2A tables 20, 21, and 22 
iii. Experiment 3A tables 33, 34, and 35 
b. Each item in the test-item database was empirically assigned a difficulty score rang- 
ing from .25 (25%) to .95 (95%) using the Parry Method (a variation of the Angoff 
Method). Each item was then classified as Easy, Moderate, or Hard based upon the 
score range determined by dividing the entire score range into thirds (see table A- 
1) The “cut” score for the entire database was determined by averaging the results 
of item score. This cut score was the target for each iteration of the assessment. 


Copyright © 2020 by James R. Parry 
40 


i. The mean (average) difficulty or cut score for each experiment of randomly 
generated assessments was within one point of the target cut score, and ac- 
tually closer to the desired cut score than the mean (average) using strati- 
fied-random selection (table 41). 

li. The standard deviation among each iteration was higher than those gener- 
ated using stratified-random selection which translates to a wider spread or 
variance of actual difficulties among iterations. This is evidenced by the 
comparison of the standard score distribution plots (Figure 13). 


If test items are selected from a test-item database, using stratified-random selection by 
difficulty and subject matter, the resulting assessments will be consistent in coverage of 
topics and difficulty. This is evidenced by the results of experiments 1B, 2B, and 3B. 


a. 


Selection of the number of test-items from each topic was consistent with each it- 
eration of the randomly generated assessments. The selection of the number of 
items from each topic was a forced selection based upon the number of items avail- 
able in each topic vs. the desired test length. The philosophy behind this selection 
is described in Appendix A. 
i. Experiment 1B tables 6, 7, and 8 

ii. Experiment 2B tables 20, 21, and 22 

iii. Experiment 3B tables 33, 34, and 35 
As previously stated, each item in the test-item database was empirically assigned 
a difficulty score and placed into appropriate topic and difficulty folders. 

i. Stratified-random selection, based upon difficulty as well as topic, consist- 
ently produced assessments within several points of the target cut score. The 
assessment difficulty in relation to the target cut score varied slightly with 
each iteration. 

1. Experiment 1B tables 6, 7, and 8 
2. Experiment 2B tables 20, 21, and 22 
3. Experiment 3B tables 33, 34, and 35 
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ii. Although the difference between the mean (average) difficulty or cut score 
and the target cut score for each experiment of stratified-random generated 
assessments was slightly more than that of the randomly generated assess- 
ments, (table 41), the standard deviation among iterations was lower which 
translates to a smaller spread or variance of actual difficulties among itera- 
tions. This is evidenced by the comparison of the standard score distribution 
plots (Figure 13). 


Target Difficulty (Cut-Score) vs. Actual Results 
Stratified-Random 
Selection 

58.21 58.54 58.84 

NO32 75.87 74.11 

64.73 64.15 63.13 
Table 41 - Target Difficulty (Cut-Score) vs. Actual Results 
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Figure 13 - Comparison of Standard Distribution of Difficulty - Random Selection vs. Stratified-Random Selection 
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RECOMMENDATIONS 


In order to maintain fairness and ensure tests are valid, reliable, without bias and defensible when 
generated from a test-item database: 


Test-items must be constructed using universally recognized standards 

Cut scores should be established using a recognized test-centered method or, if appropriate, 
a test-taker centered method, because arbitrary methods are not defensible 

Each item in a test-item database should be evaluated by a panel of expert judges and a 
difficulty score or rating established based upon the agreed upon MAC level of the target 
test-taker 

Test items should be selected using stratified randomization based upon both topic cover- 
age as well as item difficulty to ensure equitable parallel assessments are generated 

Tests should not be generated in a pure random fashion from a test-item database without 
regard to content because content coverage will be erratic 

Tests should not be generated in a pure random fashion from test-item database without 
regard to difficulty of test-items because difficulty among tests will be erratic 
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APPENDIX A 


Description of the spreadsheet tool: 

The spreadsheet tool is designed to be used in conjunction with most any test-centered cut score 
setting method that produces a difficulty rating for individual test-items. It has been used very 
successfully in conjunction with the Modified Angoff method*> and is designed primarily for 4- 
alternative multiple-choice” test-items with allowable rating scores between .25 and .95. A ver- 
sion has also been developed for use with 2 through 7-alternative items with allowable rating 
scores adjusted to suit the number of alternatives with a minimum rating score of 14.2 for a 7- 
alternative item (table A-1). There are two trains of thought for the maximum rating ceiling; 1.00 
or .95 (100% or 95%). The original Angoff Method allowed any value from .25 through 1.00 and 
only asked the judges to consider one individual. I chose a variation of the Modified Angoff 
Method (Parry Method) which asks the judges to consider 100 test-takers who are considered to 
be minimally competent (in the subject being tested) and restricts the judges to preset values al- 
lowed for simplicity. (.25, .30, .35, .40, .45, 50, .55, .60, .65, .70, .75, .80, .85, .90, .95). 1.0 (100%) 
is eliminated as an item that is responded to correctly by a minimally competent examinee 100% 
of the time is considered an unnecessary test-item in most cases. 


Difficulty Values Based on Number of Alternatives} 


50-65 
33 - 53.8 
25 - 48.3 


ae 
6 


16.7 - 42.8 
14.2 - 41.2 


Table A- 1 - Difficulty Values Based on Number of Alternatives 


The description of the spreadsheet function will use examples from the 4-alternative tool. Both 
tools function the same with the exception of the 2 through 7-alternative tool requires an entry by 
the user to identify the number of alternatives to ensure proper difficulty calculations. The tool has 
not been optimized for items that use multiple-response”’ items types but does allow for the mixing 
of multiple-choice items with different numbers of alternatives. It is recommended that mixing 
test-items with varying numbers of alternatives within the database be limited and used with ex- 
treme caution as the results of stratified random selection may generate assessments outside of the 
desired difficulty range or present an unbalanced mix of items with varying numbers of alterna- 
tives. 


25 A well-established method involving three basic steps: conceptualizing the borderline examinee, identifying spe- 
cific test-items, and using expert judges to estimate what percentage of borderline examinees should respond cor- 
rectly. 

6 A multiple-choice test-item only allows the test-taker to select one alternative as the correct/incorrect choice 

27 A multiple-response test-item allows the test-taker to select more than one alternative as correct/incorrect re- 


sponses. 
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A-1 


Design philosophy of the Compass Consultants, LLC spreadsheet tool: The tool is designed 
to assist in setting a cut score for an assessment based on the results of a test-centered cut-score 
rating session. Additionally, it will determine the number of items from each section at each level 
of difficulty (Hard, Moderate or Easy) as set by the cut-score rating of each item. This assumption 
is made for 4-choice, multiple choice items with a floor of 25% and a ceiling of 95%. The diffi- 
culty is then assigned based on dividing the difference between 25% and 95% by 3 to arrive at the 
three difficulty levels. The workbook is designed to accommodate up to ten (10) reviewers on the 
rating panel. The ratings assigned to each test-item by the individual judges are averaged and a 
difficulty score is assigned to the test-item. If the judge’s individual ratings produce a standard 
deviation’® of 10 or greater the item is flagged for discussion among the judges to either come to 
a consensus by modifying their ratings, retire the item as a ‘bad’ item or leave it as written and 
rated. A final test design is proposed after all items in all topics have been rated. The totals required 
from each section are based upon the numbers of each level of difficulty available in each section 
as well as the total number of items available. An assumption is made that if there are fewer items 
available in any particular section(s) than in other section(s), then that section is of less importance 
or has significantly fewer objectives. As data for each item is entered in each section, the final test 
design worksheet will be updated automatically. 


Function of the Spreadsheet Tool 

The tool is designed to accommodate 20 topics with 200 test-items per topic. The number of topics 
and test-items can be expanded upon request to Compass Consultants, LLC. The tool will allow 
for the input from up to 10 judges. Typically, 6 — 8 judges are sufficient with less than three not 
producing reliable ratings and more than 10 as being difficult to arrive at a consensus. Refer to 
table A-2 as the spreadsheet tool is described. 


e As all of the heading information is entered by the facilitator, each of the individual work- 
sheets populate automatically 

e Generally, the Enter * If New, R if Retired block is left blank initially. As items are 
reviewed during the rating session, some may be omitted. If an item has NEVER been 
presented on an assessment and is omitted during the review, all data can be omitted and 
the row left blank. If additional items are added to the database in the future they should 
be indicated with an ‘*’ for easy identification. It is not usually necessary to re-evaluate all 
items but be aware the section/topic difficulty as well as the cut score may change. If an 
item that is currently in use on an assessment is subsequently retired from use the worksheet 
row should be updated with an ‘R’. This will remove any cut score/difficulty calculations 
from the tool and the section/topic difficulty as well as the cut score may change. To main- 
tain defensibility for possible future challenges this data should remain in the permanent 
assessment documentation. 

e Each test-item is given a Question Identifier (Test-item QID) in the test-item database 
before a rating session begins. 

e As each expert (judge) ‘score’ is entered for each item the tool is updated in real time: 


28 Standard Deviation (o) is a measure of how spread out numbers are. 
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A-2 


Difficulty Metatag presents the difficulty as easy, moderate, or hard and the color 
changes as appropriate with green indicating ‘easy, yellow indicating ‘moderate’, 
and red indicating ‘hard’ based upon the judges average score 

Average Percentage Correct (Angoff Rating) for the item is updated to reflect 
the running average score 

Standard Deviation is updated. If the standard deviation is 10 or greater the block 
displays a red flag to facilitate location during the discussion phase of the rating 
session 

The Topic Cut Score block updates to reflect the current cut score/difficulty for 
the topic 

A running total of the number and percentage of items at each difficulty level is 
displayed in a block to the lower right. 


fepert S| Cepert 4 LY Capert $ | Caper 00 
tame | tame femme | flatne 
| 


A standard deviation of more then 
20 will trigger an wlert. Discwes the 
Outliers with the judges who set 
thes to Getermine wi. Cnange as 


Table A- 2 - Cut Score Calculation Too! Data Entry and Display 
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As data is entered, descriptive statistics are calculated for each judges’ ratings as illustrated by 
table A-3. 


1 
2 
3 
4 
5 
6 
7 
8 
9 
10 


Calculated Cut Score “ Standard Error of the 
(Mean) marae Mean for Entire Section 


Average Standard = The standard error of the mean, also 
Deviation for Unit called the standard deviation of the 
mean,is used to estimate the 
standard deviation of a sampling 
Number of Items distribution. The smallerthe error, 
Retired (not counted) the more reliable the measurement. 


Table A- 3 - Judges' Descriptive Statistics 


The descriptive statistics are copied for each topic/section to a separate worksheet “CONSOLI- 
DATED DESCRIPTIVE STATISTICS” that displays all of the descriptive statistics for the entire 
workbook on one worksheet (Figure A-4 — partial view). 
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‘Standard Error of the : 
Maen lor Entire Section Mean for tntire Section 


The starcerd ercr of the mean, eo) The stancerc ero of the meen, siso 
called the stancderd deviation of the called the standard dewietion of the 
Meer, (t used to eeienate the meen is used to estimate the 
Manders desiatoo cf a sampling Manderd deviates cf s saenporg 
desyibuter The smatier ine erpt diswimution, The smatier ine eetor 
ithe more reliable the Messurement the more fetiabie The Measurement 


Table A- 4 - Consolidated Descriptive Statistics Sample 
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In addition to the consolidated view of the descriptive statistics, a table comparing the judges’ 
ratings for the entire database is presented (Table A-5). This table is useful to determine if indi- 
vidual judges have typically scored higher or lower than the rest of the group and whether consid- 
eration should be given to disregard that judge’s scores. 


Dindige’s Baean sapacistareses Judge's 
judge | RatingFor au | DcviationFrom | Standard 
ceeiinias Assessment Cut — Error of the 
Score Mean 
4.29 0.37 
0.87 0.08 
4.94 0.43 
3.05 0.26 
0.63 0.05 
5.46 0.47 
2.21 0.19 


1 
2 
3 
4 
5 
6 
7 
8 
9 


ee 
i=] 


NOTE: These calculations assume that the judge has 
contributed input to all topics being analyzed. If the judge 
did not contribute, their average rating will not be 
included in this overall rating table - refer to the 
individual section descriptive statistics for their 
information. 


Table A- 5 - Judges' Comparative Ratings 


The results of each or the worksheets are consolidated on the “FINAL TEST DESIGN” worksheet 
(Table A-6) which displays totals and percentages of items at each difficulty level for each topic 
as well as presents a recommended test design to maintain the projected cut score as well as ensure 
every iteration of the assessment that is generated in a stratified-random format is equal in content 
and content difficulty. Each column is explained below unless it is self-explanatory. 
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i 


Desired 
Test Size 


Pe 17.43 The Checksum to the left will alert you if the selected value does not match the desired test size. 


‘on Item Difficulty 


Table A- 6 - Final Test Design Display 


e Column D — displays the percentage of items from each topic in relation to the total items 
available. 

e Columns E through J — display the total number and percentage of items at each level of 
difficulty in each topic. NOTE: Percentages are rounded for display. 

e Column K — presents the total number of items required from each topic to maintain the 
percentages shown in column D for the desired test size entered in the “Set Desired Test 
Size” block at the bottom of column F. 

e Columns L, N, and P — presents the recommended number of items from each topic at each 
level of difficulty to be used to maintain the desired cut score as well as topic coverage. 
These numbers were not rounded to allow the test designer to make informed decisions as 
to how many items to enter in columns M, O, and Q. 
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e Columns M, O, and Q — The test designer enters the whole number of items as close to the 
recommended numbers as possible while referring to the “Checksum” block below the 
desired test size block at the bottom of column F. If there are more items selected than the 
test size, the checksum block will alert the designer by changing color (Figure A2). The 
designer must then make an informed decision as to which topic(s) to subtract or add items 
to ensure topic equity. 


Set 
Desired 
Test Size 


Figure A- 1 - Test Design Sheet Checksum Warning 


e The note at the bottom of columns J through R explains a warning that will appear in the 
“Total needed from topic” block(s) if there is not a sufficient quantity of items available 
to generate a fair test of the size desired. 

e The box labeled “Approximate Test Time in Minutes Based on Item Difficulty” is calcu- 
lated using rule of thumb values for the time it takes a test-taker to respond to a test item 
based upon its difficulty (see table 3). 


Note: The spreadsheet tool is available for licensing. Send request to: info @gocompassconsult- 
ants.com 
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